As silicon technology migrates from 0.18 micron to 0.13 micron and below, it has become possible to integrate a large number of processors and peripheral devices and more than 1 Mbyte of SRAM on a single chip. System-on-chip design thus becomes more attractive for a wide range of high-performance, low-power digital signal processing applications that are traditionally implemented using multiple silicon chips on a board. The SoC approach is necessary not only for small-form-factor, handheld devices but also for telecom central-office applications, where ultrahigh channel density per square inch and board and power limitations are the most important factors.
SoC design poses new challenges to chip designers in system architecture definition, logic design, system intellectual-property integration, functional verification, place and route, testing, system software development and debug. It requires new design methodologies and tools and a different skill set than traditional ASIC chip design. The following case study illustrates the new challenges and describes the methodologies 3DSP is using to address them.
In a telecom central-office access gateway application, the DSP subsystem needs to support a wide variety of media access standards and modem protocols at an extremely high channel density (OC-3, OC-12, OC-48 and beyond). The end system also has strict form area restrictions and power limitations.
In a traditional design methodology, the system designer would come up with a preliminary system requirement document, search for standard DSP chip sets that satisfy the requirement and then build the board to fit the chip set. Because the system design is determined by the chip set, which is designed by chip designers who may not be familiar with end systems, the end system often is not optimized for the application.
We took a different approach to SoC design. The architecture definition started from the system board design, and the system designers decided what the chip specification should be. For the application in question, we determined from system analysis that eight DSPs per chip would be optimal for the board design. Because the chip would support different communication standards that had different Mips and memory requirements, we carefully analyzed external memory bandwidth needs, since bandwidth rather than DSP processing power is the primary bottleneck for certain applications.
Embedded DRAM was considered, but the chip designers decided the potential yield risk was too high. Double-data-rate (DDR) DRAM was chosen instead of SDRAM for the external memory interface in order to balance the system memory bandwidth requirement and DSP processing power.
A number of other peripheral devices were also chosen based on the end system requirement. Careful latency, efficiency and bandwidth requirements were studied to avoid any potential I/O bandwidth issues.
After the peripheral devices and DSPs were chosen, the system chip designers looked at the on-chip system integration. 3DSP provides the configurable DSP Shuttle bus as a system-integration bus for DSPs, microprocessors and peripheral devices.
Because of the large memory bandwidth requirement, very high channel density and latency requirements, DSP Shuttle bus bandwidth and efficiency were carefully analyzed. We first looked at the maximum bus speed. That was decided not by the logic gate delay but the wire RC delay, which has become an important factor in the speed of large deep-submicron designs.
After determining the bus speed we studied the bus system efficiency. The DSP Shuttle can be viewed as a very complicated system switch. The DMA bus controller, port buffers, external I/O bandwidth, latency, burst size and system data traffic pattern will determine the efficiency. Different local data buffers were placed using our HiFI (Highly Flexible Integrated DSP) SoC development system to optimize the efficiency of the bus. For the DDR DRAM interface, additional buffers were added to improve the I/O bandwidth.
Software partitioning
The hardware system requirement is only one driver. System software design also will heavily influence the chip design.
In our application, the SoC chip needs to process a very large number of channels in a multiprocessor environment. The channels do not consume much Mips or memory, but we require the system to be scalable and to be able to switch and run different communication algorithms simultaneously. The system designers decided to use symmetrical multiprocessing, wherein each processor is independent. The advantage is that the system is highly scalable and easy to manage. A system manager can have one processor run one DSP communication algorithm while another processor runs a completely different algorithm. The system load can also be dynamically configured and changed on the fly.
We used 3DSP's proprietary RTOS, Speedi, combined with the DSP Shuttle, to do the 500-plus tasks' scheduling and data preparation.
The software system analysis also decides the DSP core's on-chip program, data memory and peripheral bandwidth requirements. In this design, because the memory occupies more than 70 percent of the chip area, software system development was started at a very early stage to reduce the program and data memory requirement by using a different swap. However, memory swapping increases the system bandwidth requirement. The software programmer and hardware designers worked very closely to make the right trade-offs to reduce the chip cost.
3DSP's SP-5 supports either dual- or single-port memory. For a dual-port memory configuration, the SP-5 can do four memory accesses per cycle: two read or write accesses for A memory and two read or write accesses for B memory. The data memory is further partitioned into four banks. The DSP processes the data from the first bank and stores it to the second bank. The DSP Shuttle transfers the data to the third bank and transfers the data out from the fourth bank. The advantage of the dual-port memory is that there are no memory access restrictions for the DSP core; the trade-off is that the dual-port memory silicon area is twice the size of the single-port memory.
For single-port memory, the memory is further partitioned to four odd and even banks. If both instructions of the DSP core are trying to access the same memory bank, the DSP core will be stored by one cycle. The system architects explored that possibility by using 3DSP's cycle simulator to profile the performance of the key algorithms using single-port memory. We found that there was no impact for common DSP algorithms such as FFT or FIR. For a more complicated voice compression algorithm such as G.729A+B, the memory impact on the performance is about eight to nine times better.
After weighing the performance and the area cost, the system designers chose to use single-port memory.
After the multicore chip is specified, chip implementation becomes the next big challenge. Given the large scale and complexity of the design, it is impossible to design the whole chip from scratch; proven, mature IP becomes the critical enabler. In our design, we used 3DSP's SP-5flex IP core, DSP Shuttle and a number of other peripheral devices. We also introduced a third-party DDR DRAM controller from our Masterpiece partner Denali, which designed the interface for the DDR DRAM to the DSP Shuttle.
Algorithm developers and hardware chip designers worked together using 3DSP's HiFI configuration tools to generate custom instructions for the SP-5flex DSP core to double certain application performance.
We adopted a hierarchical layer IP design methodology, wherein each IP block is designed and verified independently. First, the DSP Shuttle was tested alone with all the configurations using a random test methodology. It was then used as an integration interface to tie together all the other IP blocks. The interfaces between the IP blocks and the DSP Shuttle were then tested and verified.
Top-level simulation is very slow because the scale of the design is greater than what most of the design CAD tools are designed for. Our layered IP design and verification methodology, using proven IP, can significantly reduce the design risk and design time. In this application, it took the design team two months to design and verify each new peripheral device and another month to integrate all peripheral devices to the DSP Shuttle.
Physical layout
The physical layout and speed closure are the most challenging parts of the design. For a typical high-performance DSP application, the speed of the DSP core and the memory system is very high. Because of the high gate count of the multiprocessor system, a flat place and route cannot meet the speed requirement. We partitioned the entire system to two groups. First, each DSP core is laid out with its local memory. Next, we further floor plan all sub-blocks of the DSP subsystem according to the critical path and routing congestion.
The DSP Shuttle is grouped with all other peripheral devices. At the chip top level, there are no logic gates except routing wires. The DSP Shuttle speed is limited by the physical RC delay. Because the DSP subsystem can run at a higher speed than the DSP Shuttle, an asynchronous interface was implemented for the port between the DSP subsystem and the DSP Shuttle. At the chip top level, the DSP subsystem is replicated multiple times and combined with the DSP Shuttle and peripheral devices.
The system-on-chip design requires a different set of skills and methodology from the traditional VLSI chip design. Instead of designing and optimizing a specific ASIC functionality or a processor core, the design focuses on integrating multiple processor cores and other IP blocks, analyzing the interprocessor and IP core communication protocols, and software/hardware interaction and trade-off. The end system application requirement, the board configuration, the technology and the power requirement will heavily influence the chip architecture specification.
The verification requirement shifts from verifying individual ASIC blocks to verifying the multiprocessor IP core protocols, to avoid deadlock and improve efficiency. Layered IP design and verification methodology enables designers to achieve a nine-month design cycle from the completion of the system specification to tape-out (four months for the logic design and five months for place and route). Chip design has transformed from pure hardware design to system engineering, including end system analysis, board design, hardware/software co-design, logic design using high-level RTL language, and physical layout.
---
Kan Lu, chief technology officer and co-founder of 3DSP Corp. (Irvine, Calif.), applies his DSP and MPU background from Texas Instruments and Digital Equipment. Lu earned BS and MS degrees in EE and computer science from the Massachusetts Institute of Technology.
http://www.isdmag.com
© 2001 CMP Media LLC.
10/1/01, Issue # 13148, page 30.